Video Captioning with Guidance of Multimodal Latent Topics

机译：基于多模态潜在主题指导的视频字幕

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

The topic diversity of open-domain videos leads to various vocabularies andlinguistic expressions in describing video contents, and therefore, makes thevideo captioning task even more challenging. In this paper, we propose anunified caption framework, M&M TGM, which mines multimodal topics inunsupervised fashion from data and guides the caption decoder with thesetopics. Compared to pre-defined topics, the mined multimodal topics are moresemantically and visually coherent and can reflect the topic distribution ofvideos better. We formulate the topic-aware caption generation as a multi-tasklearning problem, in which we add a parallel task, topic prediction, inaddition to the caption task. For the topic prediction task, we use the minedtopics as the teacher to train a student topic prediction model, which learnsto predict the latent topics from multimodal contents of videos. The topicprediction provides intermediate supervision to the learning process. As forthe caption task, we propose a novel topic-aware decoder to generate moreaccurate and detailed video descriptions with the guidance from latent topics.The entire learning procedure is end-to-end and it optimizes both taskssimultaneously. The results from extensive experiments conducted on the MSR-VTTand Youtube2Text datasets demonstrate the effectiveness of our proposed model.M&M TGM not only outperforms prior state-of-the-art methods on multipleevaluation metrics and on both benchmark datasets, but also achieves bettergeneralization ability.

机译：开放域视频的主题多样性导致描述视频内容时出现各种词汇和语言表达，因此使视频字幕任务更具挑战性。在本文中，我们提出了统一的字幕框架M＆M TGM，该框架以无监督的方式从数据中挖掘多模式主题，并为字幕解码器提供这些主题的指导。与预定义主题相比，挖掘的多模式主题在语义和视觉上更加连贯，可以更好地反映视频的主题分布。我们将主题感知字幕生成公式化为一个多任务学习问题，在其中我们添加了并行任务，主题预测以及字幕任务之外的内容。对于主题预测任务，我们使用Minedtopics作为老师来训练学生主题预测模型，该模型学习从视频的多模式内容中预测潜在主题。主题预测为学习过程提供了中间监督。对于字幕任务，我们提出了一种新颖的主题感知解码器，可以在潜在主题的指导下生成更准确，更详细的视频描述。整个学习过程是端到端的，并且可以同时优化两个任务。在MSR-VTT和Youtube2Text数据集上进行的大量实验结果证明了我们提出的模型的有效性.M＆M TGM不仅在多重评估指标和两个基准数据集上均优于现有技术，而且具有更好的泛化能力。

著录项

作者
Chen, Shizhe; Chen, Jia; Jin, Qin; Hauptmann, Alexander;
展开▼
作者单位

展开▼
年度 2017
总页数
原文格式 PDF
正文语种
中图分类

相似文献

外文文献
中文文献
专利

1. Generating Video Descriptions With Latent Topic Guidance [J] . Chen Shizhe, Jin Qin, Chen Jia, IEEE transactions on multimedia . 2019 ,第9期

机译：在潜在主题指导下生成视频描述
2. Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language [J] . An-An Liu, Ning Xu, Yongkang Wong, Computer vision and image understanding . 2017 ,第octa期

机译：分层和多模式视频字幕：发现视觉的多模式知识并将其转移到语言
3. Deep multimodal embedding for video captioning [J] . Lee Jin Young Multimedia Tools and Applications . 2019 ,第22期

机译：用于视频字幕的深度多模式嵌入
4. Using business-aware latent topics for image captioning in social media [C] . Yan-Ting Chen, Francine Chen, Matthew Cooper, IEEE International Conference on Multimedia and Expo . 2016

机译：使用具有业务意识的潜在主题在社交媒体中进行图像字幕
5. L1/L2 eye movement reading of closed captioning: A multimodal analysis of multimodal use. [D] . Specker, Elizabeth A. 2008

机译：隐藏式字幕的L1 / L2眼动读数：多模式使用的多模式分析。
6. Eye movements while viewing narrated captioned and silent videos [O] . Nicholas M. Ross, Eileen Kowler -1

机译：观看旁白字幕和无声视频时的眼球运动
7. Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning [O] . Xin Wang, Jiawei Wu, Da Zhang, 2019

机译：学习撰写零射击视频标题的专题听觉主题混合

Video Captioning with Guidance of Multimodal Latent Topics

摘要

著录项

相似文献

相关主题

期刊订阅